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Abstract 

We consider the problem of selecting the "best" subset of exactly k columns from an 
m X n matrix A. In particular, we present and analyze a novel two-stage algorithm that runs 
in 0{mm{mn^ ,m^n}) time and returns as output an m x fc matrix C consisting of exactly 
k columns of A. In the first stage (the randomized stage), the algorithm randomly selects 
Q{k\ogk) columns according to a judiciously-chosen probability distribution that depends on 
information in the top-fc right singular subspace of A. In the second stage (the deterministic 
stage), the algorithm applies a deterministic column-selection procedure to select and return 
exactly k columns from the set of columns selected in the first stage. Let C be the m x k 
matrix containing those k columns, let Pc denote the projection matrix onto the span of those 
columns, and let Ak denote the "best" rank-fc approximation to the matrix A as computed 
with the singular value decomposition. Then, we prove that, with probability at least 0.8, 

\\A~PcA\\p < e(fclogi/'fc) \\A~Ak\\p. 

This Frobenius norm bound is only a factor of ^/k]ogk worse than the best previously existing 
existential result and is roughly 0{Vk\) better than the best previous algorithmic result (both 
of Deshpande et al. [TT]) for the Frobenius norm version of this Column Subset Selection 
Problem. We also prove that, with probability at least 0.8, 

||A-PcA||2<e(fclogi/2fc) \\A-Ak\\^ + e(k^/Hog^^H) \\A-Ak\\p. 

This spectral norm bound is not directly comparable to the best previously existing bounds for 
the spectral norm version of this Column Subset Selection Problem (such as the ones derived 
by Gu and Eisenstat in [33]). More specifically, our bound depends on ||A — whereas 
previous results depend on — k \\A — Ak\\2] if these two quantities are comparable, then 
our bound is asymptotically worse by a (fclogfc)^^^ factor. 

1 Introduction 

We consider the problem of selecting the "best" set of exactly k columns from an m x n matrix 
A. More precisely, we consider the fohowing Column Subset Selection Problem (CSSP): 

*A conference proceedings version of this paper appeared in [4j. This manuscript presents a modified sampling- 
based algorithm that fixes a bug in Lemma 4.4 of [4] . This bug also affects the spectral norm bound for the CSSP 
that was reported in [4]; the Frobenius norm bound remains unaffected. 
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Definition 1 (The CSSP) Given a matrix A G '^i^^^ and a positive integer k, pick k columns 
of A forming a matrix C G JR^x*^ such that the residual 

\\A-PcA\\^ 

is minimized over all possible (^) choices for the matrix C. Here, Pc = CC^ denotes the 
projection onto the k-dimensional space spanned by the columns of C and (, = 2 or F denotes the 
spectral norm or Frobenius norm. 

That is, the goal of the CSSP is to find a subset of exactly k columns of A that "captures" as 
much of A as possible, with respect to the spectral norm and/or FVobenius norm, in a projection 
sense. The CSSP has been studied extensively in numerical linear algebra, where it has found 
applications in, e.g., scientific computing [6]. More recently, a relaxation has been studied in 
theoretical computer science, where it has been motivated by applications to large scientific and 
internet data sets [16] . 

1.1 Complexity of the CSSP 

We briefly comment on the complexity of the problem. Clearly, in 0{n^) time we can generate 
all possible matrices C and thus find the optimal solution in 0{n^mnk) time. However, from a 
practical perspective, in data analysis applications of the CSSP (see Section [l.2p . n is often in 
the order of hundreds or thousands. Thus, in practice, algorithms whose running time depends 
exponentially on k are prohibitively slow even if k is, from a theoretical perspective, a constant. 
Finally, the NP-hardness of the CSSP (assuming is a function of n) is an open problem. Note, 
though, that a similar problem, asking for the k columns of the m x n matrix A that maximize 
the volume of the parallelepiped spanned by the columns of C, is provably NP-hard p^ . 

1.2 The CSSP in statistical data analysis 

In data applications, where the input matrix A models m objects represented with respect to n 
features, the CSSP corresponds to unsupervised feature selection. Standard motivations for fea- 
ture selection include facilitating data visualization, reducing training times, avoiding overfitting, 
and facilitating data understanding. 

Consider, in particular. Principal Components Analysis (PC A), which is the predominant 
linear dimensionality reduction technique, and which has been widely applied on datasets in 
all scientific domains, from the social sciences and economics, to biology and chemistry. In 
words, PCA seeks to map or embed data points from a high dimensional Euclidean space to 
a low dimensional Euclidean space while keeping all the relevant linear structure intact. PCA 
is an unsupervised dimensionality reduction technique, with the sole input parameters being 
the coordinates of the data points and the number of dimensions that will be retained in the 
embedding (say k), which is typically a constant independent of m and n; often it is /c ^ 
{m, n} too. Data analysts often seek a subset of k actual features (that is, k actual columns, as 
opposed to the k eigenvectors or eigenfeatures returned by PCA) that can accurately reproduce 
the structure derived by PCA. The CSSP is the obvious optimization problem associated with 
such unsupervised feature selection tasks. 

We should note that similar formulations appeared in [25l [Ml EHl HQI [SHI [l]. In addition, 
applications of such ideas include: (i) [37], where a "compact CUR matrix decomposition" was 
applied to static and dynamic data analysis in large sparse graphs; (ii) [26l |27l [I4j , where these 
ideas were used for compression and classification of hyperspectral medical data and the re- 
construction of missing entries from recommendation systems data in order to make high-quality 
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recommendations; and (iii) [33], where the concept of "PCA-correlated SNPs" (Single Nucleotide 
Polymorphisms) was introduced and applied to classify individuals from throughout the world 
without the need for any prior ancestry information. See [3] for a detailed evaluation of our main 
algorithm as an unsupervised feature selection strategy in three application domains of modern 
statistical data analysis (finance, document-term data, and genetics). 

1.3 Our main results 

We present a novel two-stage algorithm for the CSSP. This algorithm is presented in detail in 
Section [3] as Algorithm [TJ In the first stage of this algorithm (the randomized stage), we randomly 
select Q{klogk) columns of , i.e., of the transpose of the n x k matrix consisting of the top 
k right singular vectors of A, according to a judiciously-chosen probability distribution that 
depends on information in the top-fc right singular subspace of A. Then, in the second stage 
(the deterministic stage), we apply a deterministic column-selection procedure to select exactly 
k columns from the set of columns of Vj^ selected by the first stage. The algorithm then returns 
the corresponding k columns of A. In Section [4] we prove the following theorem. 

Theorem 1 There exists an algorithm (the two-stage AlgorithmUl) that approximates the solution 
to the CSSP. This algorithm takes as input an m x n matrix A and a positive integer k; it runs 
in 0{mm{mn'^,m?n}) time; and it returns as output an m x k matrix C consisting of exactly k 
columns of A such that with probability at least O.S.- 



Here, Pc = CC^ denotes a projection onto the column span of the matrix C , and A^ denotes the 
best rank-k approximation to the matrix A as computed with the singular value decomposition. 

Note that we can trivially boost the success probability in the above theorem to 1 — (5 by repeating 
the algorithm O (log (1/(5)) times. Note also that the running time of our algorithm is linear in 
the larger of the dimensions m and n, quadratic in the smaller one, and independent of k. Thus, 
it is practically useful and efficient. 

To put our results into perspective, we compare them to the best existing results for the CSSP. 
Prior work provided bounds of the form 



where p{k,n) is a polynomial on n and k. For ^ = 2, i.e., for the spectral norm, the best 



previously-known bound for approximating the CSSP is p{k, n) = ( -s/k^n — A:) -|- 1 1 [23], while 



for ^ = F, i.e., for the Frobenius norm, the best bound is p{k,n) = {k + 1)! [H]. Both results 
are algorithmically efficient, running in time polynomial in all three parameters m, n, and k. 
The former runs in 0{mnklogn) time and the latter runs in 0{mnk + kn) time. Our approach 
provides an algorithmic bound for the Frobenius norm version of the CSSP that is roughly 0(\/fcI) 
better than the best previously- known algorithmic result. It should be noted that [11] also proves 
that by exhaustively testing all (^) possibilities for the matrix C, the best one will satisfy eqn. 
([TJ with p{k,n) = \/k + 1. Our algorithmic result is only 0{\/k log k) worse than this existential 
result. A similar existential result for the spectral norm version of the CSSP is proved in [23] with 

p{k,n) = y/k{n — k) + 1. Our spectral norm bound depends on G ( A;^/^ log^^'^ /c ) — In 




A-PcA\\^ < p{k,n)\\A-Ak 




(1) 
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a worst case setting (e.g., when all the bottom n — k + 1 singular values of A are equal) this 



Finally, we should emphasize that a novel feature of the algorithm that we present in this 
paper is that it combines in a nontrivial manner recent algorithmic developments in the theoretical 
computer science community with more traditional techniques from the numerical linear algebra 
community in order to obtain improved bounds for the CSSP. 

2 Background and prior work 
2.1 Notation and linear algebra 

First, recall that the 0-notation can be used to denote an asymptotically tight bound: /(n) £ 
@{g{n)) or /(n) = Q{g{n)) if there exist positive constants ci, C2, and no such that < cig{n) < 
f{n) < C2g{n) for all n > hq. This is similar to the way in which the big-O-notation can be used 
to denote an asymptotic upper bound: /(n) = 0{g{n)) if there exist positive constants c and no 
such that < /(n) < cg{n) for all n > no. 

Let [n] denote the set {1,2, . . . ,n}. For any matrix A G R"^^", let G [m] denote the 

i-th row of ^ as a row vector, and let A^^\j G [n] denote the j-th column of ^ as a column 
vector. In addition, let ||A||^ = J2i j ^ij denote the square of its Frobenius norm, and let 
IIAII2 = sup^gjjn^ ^-^0 I^^l2 / I^l2 denote its spectral norm. If A E M™^", then the Singular Value 
Decomposition (SVD) of A can be written as 



In this expression, p < min{m, n} denotes the rank of A, Ua G M™^^ is an orthonormal matrix, 
Tja is a /) X /) diagonal matrix, and Va £ M"^^ is an orthonormal matrix. Also, S^, denotes the kxk 
diagonal matrix containing the top k singular values of A, Sp_fc denotes the [p — k) x {p — k) 
matrix containing the bottom p — k singular values of A, denotes the n x k matrix whose 
columns are the top k right singular vectors of A, and Vp-k denotes the nx (p — k) matrix whose 
columns are the bottom p — k right singular vectors of A, etc. 

The mxk orthogonal matrix Uk consisting of the top k left singular vectors of A is the "best" 
set of k linear combinations of the columns of A, in the sense that A^ = Puk^ — Uk^kyk" 
"best" rank k approximation to A. Here, Pfj^. = UkU'^ is a projection onto the A:-dimensional space 
spanned by the columns of Uk- In particular, A^ minimizes ||^ — A'\\^, for both ^ = 2 and F, 
over all m X n matrices A' whose rank is at most k. We also denote Ap^k = U p-kT^ p-kVj_j,. We 
will use the notation ||-||^ when writing an expression that holds for both the spectral and the 
Frobenius norm. We will subscript the norm by 2 and F when writing expressions that hold for 
one norm or the other. Finally, the Moore-Penrose generalized inverse, or pseudoinverse, of A, 
denoted by j4+, may be expressed in terms of the SVD as = V^iS^^C/J. 

Finally, we will make frequent use of the following fundamental result from probability theory, 
known as Markov's inequality |30j . Let X be a random variable assuming non- negative values 
with expectation E[X]. Then, for all t > 0, X < t • E [X] with probability at least 1 — . 
We will also need the so-called union bound. Given a set of probabilistic events £i,£2-, ■ ■ ■ -.Sn 
holding with respective probabilities pi,P2, ■ ■ ■ ,Pm the probability that all events hold (a.k.a., the 
probability of the union of those events) is upper bounded by 




A 



Ua^aVJ 
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2.2 Related prior work 

Since solving the CSSP exactly is a hard combinatorial optimization problem, research has histor- 
ically focused on computing approximate solutions to it. Since ||A — provides an immediate 
lower bound for ||^ — -Pc^llg; for ^ = 2, F and for any choice of C, a large number of approxima- 
tion algorithms have been proposed to select a subset of k columns of A such that the resulting 
matrix C satisfies 

\\A-Ak\\^ < \\A - PcA\\^ < p{k,n) \\A - AkW^ 

for some function p{k,n). Within the numerical linear algebra community, most of the work on 
the CSSP has focused on spectral norm bounds and is related to the so-called Rank Revealing 
QR (RRQR) factorization: 

Definition 2 (The RRQR factorization) Given a matrix A £ (m > n) and an integer 

k (k < n), assume partial QR factorizations of the form: 

where Q G ^™x" ig orthonormal matrix, R G Jl"-^"- is upper triangular, Rn G R^^^ , R12 G 
j^kx{n-k) ^ i?22 G R['"--k)x{n-k) ^ 11 G i?"^" is a permutation matrix. The above factorization 
is called a RRQR factorization if it satisfies 



pi{k,n) 

CTk+liA) < (Jmax{R22) < P2 (A:, n)(Tfc+l ( A) , 

where pi{k,n) and p2{k,n) are functions bounded by low degree polynomials in k and n. 

The work of Golub on pivoted QR factorizations [21] was followed by much research addressing 
the problem of constructing an efficient RRQR factorization. Most researchers improved RRQR 
factorizations by focusing on improving the functions pi{k,n) and p2{k,n) in Definition [2J Let 
Ilfc denote the first k columns of a permutation matrix H. Then, if C = AHi^. is an m x A; matrix 
consisting of k columns of A, it is straightforward to prove that 

M ~ -fc^ll^ = 11-^2211^ > 

for both = 2,F. Thus, in particular, when applied to the spectral norm, it follows that 
\\A-PcA\\2 <P2{k,n)ak+i{A) = p2{k,n) \\A - Ak\\2 , 

i.e., any algorithm that constructs an RRQR factorization of the matrix A with provable guaran- 
tees also provides provable guarantees for the CSSP. See Table[T]for a summary of existing results, 
and see |19] for a survey and an empirical evaluation of some of these algorithms. More recently, 
[291 [39] proposed random-projection type algorithms that achieve the same spectral norm bounds 
as prior work while improving the running time. 

Within the theoretical computer science community, much work has followed that of Frieze, 
Kannan, and Vempala |20j on selecting a small subset of representative columns of A, forming a 
matrix C, such that the projection of A on the subspace spanned by the columns of C is as close to 
A as possible. The algorithms from this community are randomized, which means that they come 
with a failure probability, and focus mainly on the Frobenius norm. It is worth noting that they 
provide a strong tradeoff between the number of selected columns and the desired approximation 



5 



Method 


Reference 


p(k,n) 


Time 


Pivoted QR 


[Golub, 1965] 


[21] 


Y^(n - k)2'' 


0{mnk) 


High RRQR 


[Foster, 1986] 


[18] 


^n{n - k)!"-" 


0{mn'^) 


High RRQR 


[Chan, 1987] 


[5] 


Vn(n - A:)2"-^ 


0{mn^) 


RRQR 


[Hong and Pan, 1992] 


[24] 


y/k{n -k) + k 


- 


Low RRQR 


[Chan and Hansen, 1994] 


m 


y/{k + l)n2'=+^ 


0{mn^) 


Hybrid-I RRQR 


[Chandr. and Ipsen, 1994] 


M 

t — ! 


y^{k + l){n-k) 


- 


Hybrid-H RRQR 




M 


y/{k + l){n-k) 


- 


Hybrid-HI RRQR 




[8] 


V{k + l){n-k) 


- 


Algorithm 3 


[Gu and Eisenstat, 1996] 

L ' J 


[23] 


\jk{n - A;) + 1 


- 


Algorithm 4 




[23] 


\/pk{n - A:) + 1 


0{kmn log f{n)) 


DGEQPY 


[Bischof and Orti, 1998] 


[2J 


0{^{k + \f{n-k)) 


- 


DGEQPX 




[2] 


oQ{k + \){n-k)) 


- 


SPQR 


[Stewart, 1999] 


[35] 


- 


- 


PT Algorithm 1 


[Pan and Tang, 1999] 


I32j 


0{^{k + l){n-k)) 




PT Algorithm 2 




[32] 


oQ{k + lY{n-k)) 




PT Algorithm 3 




[32] 


0{lj{k + lY{n-k)) 




Pan Algorithm 2 


[Pan, 2000] 


[31] 


0{ljk{n-k) + l) 





Table 1: Deterministic RRQR algorithms for the CSSP. A dash implies that either the authors 
do not provide a running time bound or the algorithm depends exponentially on k. (In addition, 
m > n and / > 1 for this table.) 



accuracy. A typical scenario for these algorithms is that the desired approximation error (see 
e below) is given as input, and then the algorithm selects the minimum number of appropriate 
columns in order to achieve this error. One of the most relevant results for this paper is a bound 
of [11], which states that there exist exactly k columns in any m x n matrix A such that 

ll^-cc+^ll^ < ,/kTT\\A- AkWp. 

Here, C contains exactly k columns of A. The only known algorithm to find these k columns 
is to try all (^) choices and keep the best. This existential result relies on the so-called volume 
sampling method jll^ I12j . In [12], an adaptive sampling method is used to approximate the 
volume sampling method and leads to an 0{mnk + kn) algorithm which finds k columns of A 
such that 

11^ - CC+A\\p < ^{k + 1)! \\A - AuWf ■ 

As mentioned above, much work has also considered algorithms choosing slightly more than k 
columns. This relaxation provides significant flexibility and improved error bounds. For example, 
in [12], an adaptive sampling method leads to an O {mn {k/e^ + fc^logA;)) algorithm, such that 

\\A-CC^A\\p<{l + e) \\A-Ak\\F 

holds with high probability for some matrix C consisting of G (fc/e^ + fc^logA:) columns of A. 
Similarly, in jl5l llGj . Drineas, Mahoney, and Muthukrishnan leverage the subspace sampling 
method to give an 0(min{mn^, m^n}) algorithm such that 

\\A-CC+A\\p<{l + e)\\A-Ak\\F (2) 

holds with high probability if C contains Q{k\ogk/e^) columns of A. 
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3 A two-stage algorithm for the CSSP 



In this section, we present and describe Algorithm [H our main algorithm for approximating the 
solution to the CSSP. This algorithm takes as input an m x n matrix A and a rank parameter 
k. After an initial setup, the algorithm has two stages: a randomized stage and a deterministic 
stage. In the randomized stage, a randomized procedure is run to select G (/clog k) columns from 
the kxn matrix , i.e., the transpose of the matrix containing the top- A; right singular vectors of 
A. The columns are chosen by randomly sampling according to a judiciously-chosen nonuniform 
probability distribution that depends on information in the top-fc right singular subspace of A. 
Then, in the deterministic stage, a deterministic procedure is employed to select exactly k columns 
from the G {k log k) columns chosen in the randomized stage. The algorithm then outputs exactly 
k columns of A that correspond to those columns chosen from Vj^ . Theorem [1] states that the 
projection of A on the subspace spanned by these k columns of A is (up to bounded error) close 
to the best rank k approximation to A. 



3.1 Detailed description of our main algorithm 



In more detail, Algorithm [T] first computes a probability distribution pi,p2, ■ ■ ■ ,Pn over the set 
{1, . . . ,n}, i.e., over the columns of Vj^ , or equivalently over the columns of A. The probability 
distribution depends on information in the top-k right singular subspace of A. In particular, for 
all i G [n], define 



Pi 



iVk] 



>ij) 



+ 







2 


1 

2 










2 



En 

n], and that J27=iPi 



iJ) 



2 ' 



(3) 



1. We will describe the computation of 



and note that pi > 0, for all i G 
probabilities of this form below. 

In the randomized stage. Algorithm [1] employs the following randomized column selection 
algorithm to choose G(A;logA;) columns from VjF to pass to the second stage. Let c assume the 
value of eqn. (jj]). In c independent identically distributed (i.i.d.) trials, the algorithm chooses a 
column of where in each trial the i-th column of VjF is kept with probability pi . Additionally, 
if the i-th column is kept, then a scaling factor equal to l/^/cpi is kept as well. Thus, at the 
end of this process, we will be left with c columns of and their corresponding scaling factors. 
Notice that due to random sampling in i.i.d. trials with replacement we might keep a particular 
column more than once. 

In order to conveniently represent the c selected columns and the associated scaling factors, 
we will use the following sampling matrix formalism. First, define an n x c sampling matrix Si 
as follows: Si is initially empty; at each of the c i.i.d. trials, if the i-th column of V^F is selected 
by the random sampling process, then (an n-vector of all-zeros, except for its i-th entry which 
is set to one) is appended to Si. Next, define the c x c diagonal rescaling matrix Di as follows: 
if the i-th column of is selected, then a diagonal entry of Di is set to l/^/cpi. Thus, we may 
view the randomized stage as outputting the matrix V^SiDi consisting of a small number of 
rescaled columns of , or simply as outputting and Di. 

In the deterministic stage. Algorithm [T] applies a deterministic column selection algorithm to 
the output of the first stage in order to choose exactly k columns from the input matrix A. To do 
so, we run the Algorithm 4 of [23] (with the parameter / set to \/2) on the k x c matrix VjFSiDi, 
i.e., the column-scaled version of the columns of VjF chosen in the first stage. Thus, a matrix 
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VjF S1D1S2 is formed, or equivalently, in the sampling matrix formalism described previously, a 
new matrix 5*2 is constructed. Its dimensions are c x k, since it selects exactly k columns out of 
the c columns returned after the end of the randomized stage. The algorithm then returns the 
corresponding k columns of the original matrix A, i.e., after the second stage of the algorithm is 
complete, the m x k matrix C = AS1S2 is returned as the final output. 



Input: m X n matrix A, integer k. 

Output: m X k matrix C with k columns of A. 



1. Initial setup: 

• Compute the top k right singular vectors of A, denoted by Vk- 

• Compute the sampling probabilities pi, for i G [n], using eqn. ^ or eqn. 

• Let 

c = leOOc^A; log (SOOc^A:) = Q{k log k). (4) 
(Here cq is the unspecified constant of Theorem [2j) 

2. Randomized Stage: 

• For t = l,...,c (i.i.d. trials) select an integer from {l,2,...,n} where the 
probability of selecting i is equal to pi. If i is selected, keep the scaling factor 

• Form the sampling matrix Si and the rescaling matrix Di (see text). 

3. Deterministic Stage: 

• Run Algorithm 4, page 853 of [23J (with the parameter / set to \/2) on the ma- 
trix VjFSiDi in order to select exactly k columns of VjFSiDi, thereby forming 
the sampling matrix 5*2 (see text). 

• Return the corresponding k columns of A, i.e., return C = ASiS2- 



Algorithm 1: A two-stage algorithm for the CSSP. 



3.2 Running time analysis 



We now discuss the running time of our algorithm. Note that manipulating the probability 
distribution of eqn. ([3]) yields: 



Pi 



2k 



+ 



(A) 



\A\\ 



(5) 



Thus, knowledge of Vk, i.e., the n x k matrix consisting of the top-k right singular vectors of 
A, suffices to compute the pj's. By eqn. ([S]), 0(min{mn^, m^n}) time suffices for our theoretical 
analysis. In practice iterative algorithms could be used to speed up the algorithm. Note also that 
in order to obtain the Frobenius norm bound of Theorem [H our theoretical analysis holds if the 
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sampling probabilities are of the form: 



2 



(6) 



That is, the Frobenius norm bound of Theorem [T] holds even if the second term in the sampling 
probabilities of eqns. ^ and ([5]) is omitted. 

Finally, we briefly comment on a technical constraint of Algorithm 4 of [23]. This algorithm 
assumes that its input matrix has at least as many rows as columns. However, in our approach, 
we will apply it on the k x c matrix Vj^SiDi, which clearly has fewer rows than columns. Thus, 
prior to applying the aforementioned algorithm, we first pad V^SiDi with c — k all-zero rows, 
thus making it a square matrix. Let Q = V^SiDi and let be the c x c matrix after the 
padding. Eqn. (8) in Theorem 3.2 of [23] (with i set to k and / set to ^/2) implies that ak{^S2) > 



(Tfc(J7)/ ( y^l + 2k{c — k)]. Clearly, ak{^S2) = (Jk{fiS2) and (Jk{^) = crk{^)- Overall, we get. 



which is the only guarantee that we need in the deterministic step (see Lemma [3]). The running 
time of the deterministic stage of Algorithm [1] is 0(c^ A; log y^) time, since the (padded) matrix 
VjfSiDi has c columns and rows. 

An important open problem would be to identify other suitable importance sampling proba- 
bility distributions that avoid the computation of a basis for the top-A; right singular subspace. 

3.3 Intuition underlying our main algorithm 

Intuitively, we achieve improved bounds for the CSSP because we apply the deterministic algo- 
rithm to a lower dimensional matrix (the matrix V^SiDi with (klogk) columns, as opposed to 
the matrix A with n columns) in which the columns are "spread out" in a "nice" manner. To see 
this, note that the probability distribution of eqn. ([6|), and thus one of the two terms in the proba- 
bility distribution of eqns. ([3]) or ([5|), equals (up to scaling) the diagonal elements of the projection 
matrix onto the span of the top-k right singular subspace. In diagnostic regression analysis, these 
quantities have a natural interpretation in terms of statistical leverage, and thus they have been 
used extensively to identify "outlying" data points [9j. Thus, the importance sampling probabil- 
ities that we employ in the randomized stage of our main algorithm provide a bias toward more 
"outlying" columns, which then provide a "nice" starting point for the deterministic stage of our 
main algorithm. This also provides intuition as to why using importance sampling probabilities 
of the form of eqn. ([6]) leads to relative-error low-rank matrix approximations \15\ 116]. 

4 Proof of Theorem [1] 

In this section, we provide a proof of Theorem [TJ We start with an outline of our proof, pointing 
out conceptual improvements that were necessary in order to obtain improved bounds. An im- 
portant condition in the first phase of the algorithm is that when we sample columns from the 
kxn matrix , we obtain a kx c matrix V^SiDi that does not lose any rank. To do so, we will 
apply a result from matrix perturbation theory to prove that if c = ©(A: log A;) (see eqn. (jl])) then 
(^/T*^!-^!) ~ -'^1 — -'-Z^- (^^^ Lemma [1] below.) Then, under the assumption that VjFSiDi has 
full rank, we will prove that the m x k matrix C returned by the algorithm will satisfy: 






V^l + 2A:(c- A;)' 



A - PcA\\^ < \\A - AkW^ + a^^ (Vf Si^i^s) fc^iZ^i 
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for both ^ = 2,F. (See Lemma [2] below.) Next, we will provide a bound on a^^ 

In order to get a strong accuracy guarantee for the overall algorithm, the deterministic column 

selection algorithm must satisfy 



p{k,c) 



where p{k, c) is a polynomial in both k and c. Thus, for our main theorem, we will employ Al- 
gorithm 4 |23j with / = \/2, which guarantees the above bound with p{k, c) = y^2k (c — k) + 1. 
(See Lemma [3] below.) Finally, we will show, using relatively straightforward matrix perturba- 



tion techniques, that 



is not too much more, in a multiplicative sense, than 

11^ — Afcll^) where we note that the factors differ for ^ = 2, F. (See Lemmas [Hand [5] below.) By 
combining these results, the main theorem will follow. 

4.1 The rank of V^SiD^ 

The following lemma provides a bound on the singular values of the matrix V^SiDi computed 
by the randomized phase of Algorithm [H from which it will follow that the matrix V^SiDi is full 
rank. To prove the lemma, we employ Theorem [5] of the Appendix (this theorem is a variant of a 
result of Rudelson and Vershynin in [34j ) . Note that probabilities of the form of eqn. ([6]) actually 
suffice to establish Lemma [TJ 

Lemma 1 Let Si and Di be constructed using Algorithmic Then, with probability at least 0.9, 

Uk {V^SiDi) > 1/2. 

In particular, SiDi has full rank. 

Proof: In order to bound cr^ (V^^SiDi), we will bound ^Vj^ SiDiDiSiVk — Ik\\2- Towards that 
end, we will use Theorem [2] with /3 = 1/2 and e = 1/20, which results in the value for c in eqn. @. 
Note that the sampling probabilities in eqn. ([5]) satisfy 



Pi > 



2k 

Now Theorem [2] and our construction of 5i and Di guarantee that for c as in eqn. (j4| 

E [ \\V^Vk - V^SiDiDiSfVkW^] < 1/20. 

We note here that the condition Cq ||Vfc||^ > 4/3e^ in Theorem [2] is trivially satisfied assuming that 
Co is at least one (given our choices for /3, e, and ||Vfc||^ = A; > 1). Using V^Vk = Ik and Markov's 
inequality we get that with probability at least 0.9, 

\\v;[SiDiDiS^Vk-Ik\\2 < 10(1/20) = 1/2. 

Standard matrix perturbation theory results [22] now imply that for all i = 1, . . . , /c, 

\aUV^SiDi)-l\<l/2. 
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4.2 Bounding the spectral and Frobenius norms of A — PcA 



Lemma 2 Let Si, Di, and S2 be constructed as described in AlgorithmU\ and recall that C = 
ASiS2- If VjfSiDi has full rank, then for ^ = 2,F, 

\\A-PcA\\^< \\A-Ak\\i: + a^^{V^SiDiS2) \\^p-kV;f_kSiDi\\^ . 

Proof: We seek to bound the spectral and Frobenius norms of A — PqA, where C = AS1S2 
is constructed by Algorithm [TJ To do so, first notice that scaling the columns of a matrix 
(equivalently, post-multiplying the matrix by a diagonal matrix) by any non-zero scale factors 
does not change the subspace spanned by the columns of the matrix. Thus, 



A - PcA = A- {AS1S2) {ASiS2)'^ A 

= A- {AS1D1S2) iASiDiS2)^ A 
= A - (AS) (AS)^ A, 



(7) 



where, in the last line, we have introduced the convenient notation S = S1D1S2 S R"'^^ that we 
will use throughout the remainder of this proof. In the sequel we seek to bound the residual 



First, note that 



1^ - p^^ll^ = 11^ - [AS] {AS)^ A\\^ . 
(^S')+^ = arg min WA-ASXW,. 

X(zRkxn ? 



(8) 



This implies that in eqn. ([8]) we can replace {AS)'^ A with any other kxn matrix and the equality 
with an inequality. In particular we replace {AS)'^ A with [Aj^S)'^ A}^, where A^ is the best rank-A: 
approximation to A: 



\A- AS{AS)+A\ 



Let A 



Up-k^p-kV^_k. Then, A = Ak + Ap_k 



< \\A-ASiAkS)+Ak\\^. 

and, using the triangle inequality, 

\A-PcA\\^ = \\Ak + Ap_k- iAk + Ap^k)SiAkS)+Ak\\^ 

< \\Ak - AkS{AkS)+Ak\\^ + WAp^kW^ + \\Ap_kS{AkS)+Ak\\ 

72 



(9) 



71 



73 



We now bound 71, 72, and 73. First, for 71, note that: 



71 



Ak-AkS{AkS)+Ak\\^ 

Uk^kV^ - Uk^k{V^S){Uk'>^kV^S)+UkT.kV^\\^ 
Uk^kV;[ - UkmVkS){V^S)+{Uk^k)^Uk^kV;[\\^ 

s, - j:k{v^s){v^s)+{Uk^k)^Uk^k\l 



'k 



0. 



(10) 

(11) 

(12) 



In eqn. ^T0\) . we replaced (C/fcSjtV^-^S)"'" by {V/^ S)~^ (Uk'^k)^ ■ This follows since the statement of 
our lemma assumes that the matrix Vj[SiDi has full rank. Also, the construction of ^2 guarantees 
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that the columns of SiDi that are selected in the second stage of Algorithm [T] are linearly 
independent, and thus the k x k matrix Vj^S = S1D1S2 has full rank and is invertible. In 
eqn. can be dropped without increasing a unitarily invariant norm, while eqn. (|12p 

1^ 



follows since is a full-rank kxk matrix. Next, note that 72 
to conclude the proof, we bound 73 as follows: 



p-fcll^ = \\A - AkWi:. Finally, 



73 = \\Ap^kS{AkS)+Ak\\^ 
= \\^p^kV^_kS{V^S)+\i 



< 



(13) 
(14) 
(15) 



Eqn. (|13p follows by the orthogonality of Up-k and Vk and the fact that V^S is a. k x k invertible 
matrix (see above). Eqn. (jl4p follows from the fact that for any two matrices X and Y and 

e = 2,F 

orthogonal matrix 



\XY\\^ < \\X\ 



Y\\2. Finally, eqn. follows since S = S1DS2 and ^2 is an 



4.3 Upper bounds for a,^ {Vj[SiDiS2) and ^ = 2,F 

Lemma 3 Let Si, Di, and S2 be constructed using Algorithm [IJ Then, with probability at 
least 0.9, 

a^^ (Vf 5iDi52) < {c-k) + l. 

Proof: From Lemma[T]we know that fij (Vj^ SiDi^ > 1/2 holds for all i = 1, . . . , A: with probability 
at least 0.9. The deterministic construction of S2 (see Algorithm 4 of (23] with the parameter / 
set to \/2) guarantees that 



akiV^SiDiS2) > 



y/2k{c-k) + 1 ~ 2^/2k{c-k) + 1 



> 



Lemma 4 ((^ = 2) If Si and Di are constructed as described in AlgorithmUl then, with proba- 
bility at least 0.9, 



\j:p-kVl!ikSiDi\\^ < \\A-Ak\\2 + -^\\A-Ak\\p 



Proof: Let r 



^p-kV^-k^p-k^p-k 



^p-fc- manipulate 



^p-kVpUSiDi 



as follows: 



\T.p_kV^_kSiDi\\2 



< 
< 



\^p^kV;!ikSiDiDiSlVp^k^p^k - r + r||2 

l^p-kyff-k^lDlDl^T^p-k^p-k — FII2 + ||llp_fe||2 
l^p-kyp-^k^l^l^l^l^p-k^p-k — '^\\f+ \\^f)-k\\2 



Given our construction of Si and Di and applying eqn. (9) of Theorem 1 of |13 ] with (3 = 1/2 
and 5 = 0.1, we get that with probability at least 0.9, 



l^p-kVp-kSlDlDlSiVp^k'^p-k - '^p-kyff-k^p-k^p-kWp 
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T ||2 
p-k\\p 
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Thus, by combining the above results and using 



\A-A 



k\\2 



and 



\A - AkWp we get 



12 

\ll,-kVj_kSiDi\\^ <j=JA- AkWl + U - AkWl ■ 



To conclude the proof of the lemma we take the square roots of both sides of the above inequality. 

Lemma 5 = F) If Si and Di are constructed as described in Algorithm[I\ then, with proba- 
bility at least 0.9, 

Proof: It is straightforward to prove that with our construction of 5*1 and Di, the expectation of 



^p-kVp-k^iDi 



is equal to 



^P-kV^-k 



In addition, note that the latter quantity is exactly 



equal to \\A — Ak\\p. Applying Markov's inequality, we get that, with probability at least 0.9, 



\\^p-kVp'_i,SiDi\\p < 10 \\A - ylfcll^ . 
Taking square roots of both sides of the above inequality concludes the proof of the lemma. 



4.4 Completing the proof of Theorem [T] 

To prove the Frobenius norm bound of Theorem [1] we combine Lemma [2] (with = F) with 
Lemmas E] and O Thus, we get 



\A-PcA\\p < \\A-Ak\\p+[2y/2k (c-fe) + l {A\\A-Ak\\F 



l + 8y/2k{c- k) + 1 \\A-Ak 



\F ■ 

Using c = @ {k log k) immediately derives the Frobenius norm bound of Theorem [TJ Notice that 
Lemma [3] fails with probability at most 0.1 and that Lemma fails with probability at most 0.1; 
thus, applying the standard union bound, it follows that the Frobenius norm bound of Theorem [1] 
holds with probability at least 0.8. To prove the spectral norm bound of Theorem [T] we combine 
Lemma [2] (with ^ = 2) with Lemmas [3] and [H Thus, we get 



PcA\\^ < \\A-Ak\\^+[2y/2k{c-k) + l) ( \\A-Ak\\2 + — \\A-Ak\\F 



r-r-, \ „ . . „ '8j2k(c-k) + 1 „ , 

\^2^j2k (c- A;) + 1) p-^fc||2 + -^^ Vt^^ 11^-^ 



ol/4 



k\\p ■ 



Using c = (A; log fe) immediately derives the spectral norm bound of Theorem [TJ Notice that 
Lemma [3] fails with probability at most 0.1 and that Lemma H] fails with probability at most 0.1; 
thus, applying the standard union bound, it follows that the spectral norm bound of Theorem [T] 
holds with probability at least 0.8. 
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Appendix 

Let A £ be any matrix. Consider the following algorithm (which is essentially the algorithm 

in page 876 of [16]) that constructs a matrix C G M™'^'^ consisting of c rescaled columns of A. We 
state Theorem 4 of [17] that provides a bound for the approximation error H^^"^ — CC"^|L. 



Data : A G M™^", pi>0,i£ [n] s.t. Ylie[n]Pi — 1' positive integer c < n. 
Result : C G M™^^ 

Initialize S G M™"^^ to be an all-zero matrix, 
for t = 1, . . . , c do 



Theorem 2 Let A G M™^" with \\A\\2 < 1. Construct C using the Exactly(c) algorithm and 
let the sampling probabilities pi satisfy 




end 



Return C = AS; 



Algorithm 2: The Exactly(c) algorithm. 




(16) 
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for all i G [n] for some constant /3 G (0,1]. Let e G (0,1) he an accuracy parameter, assume 
cl \\A\\], > 4/3e2, and let 




(Here cq is the unknown constant of Theorem 3.1, p. 8 of 134^ .) Then. 

E[||AA^-CC^||J <e. 
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